In the U.S. and Canada, breast cancer is the most diagnosed cancer among women accounting for almost 25% of all the cancer cases. It is one of the leading causes of cancer-related deaths in the U.S. In this project, we will use the Original Wisconsin Breast Cancer dataset and analyze it. Goal of the project is to make machine learning models to predict the class of the tumor and its degree of malignancy. Research Question: To create a machine learning model for the breast cancer dataset predicting the class of the tumor as well as the its degree of malignancy using various ML models. This project aims to speed up the diagnosis of breast cancer by utilizing ML to predict whether a breast tumor is benign or malignant. Based on its degree of malignancy, tumors can further be classified and appropriate future treatments can be planned depending on the associated progression rate.
Original Dataset Link: https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original
#New
# #ML
# install.packages("GGally")
# install.packages("ggpubr")
# install.packages("car")
#
# install.packages("randomForest")
# install.packages("pROC")
# install.packages("plotROC")
# install.packages("randomForest")
# install.packages("DescTools")
#
# #Old
# install.packages("tidyverse")
# install.packages("cowplot")
# install.packages("DT")
# install.packages("ggplot2")
# install.packages("formattable")
# install.packages("magrittr")
#Activation
library(ggpubr)
## Loading required package: ggplot2
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(car)
## Loading required package: carData
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(plotROC)
##
## Attaching package: 'plotROC'
## The following object is masked from 'package:pROC':
##
## ggroc
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ randomForest::margin() masks ggplot2::margin()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cowplot)
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
##
## The following object is masked from 'package:ggpubr':
##
## get_legend
library(DT)
library(ggplot2)
library(formattable)
library(corrplot)
## corrplot 0.92 loaded
library(magrittr) #for pipe compatible custom function
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
library(DescTools) #for %][% filtering
##
## Attaching package: 'DescTools'
##
## The following object is masked from 'package:car':
##
## Recode
data <- read.csv("./breast+cancer+wisconsin+original/breast-cancer-wisconsin.data", sep=",")
#checking class value distribution (Benign/Malignant)
data %>%
select(X2.1) %>%
group_by(X2.1) %>%
count()
## # A tibble: 2 × 2
## # Groups: X2.1 [2]
## X2.1 n
## <int> <int>
## 1 2 457
## 2 4 241
#2: benign, 4: malignant
For all the variables, lower values indicate more benign properties and 10 shows the most malignancy.
breast_cancer <- data %>%
rename( sampleCodeNumber =X1000025, clumpThickness = X5, uniformityCellSize = X1, uniformityCellShape = X1.1, marginalAdhesion = X1.2, singleEpiCellSize = X2, bareNuclei = X1.3, blandChromatin = X3, normalNucleoli=X1.4, mitoses=X1.5, class=X2.1 )
head(breast_cancer)
## sampleCodeNumber clumpThickness uniformityCellSize uniformityCellShape
## 1 1002945 5 4 4
## 2 1015425 3 1 1
## 3 1016277 6 8 8
## 4 1017023 4 1 1
## 5 1017122 8 10 10
## 6 1018099 1 1 1
## marginalAdhesion singleEpiCellSize bareNuclei blandChromatin normalNucleoli
## 1 5 7 10 3 2
## 2 1 2 2 3 1
## 3 1 3 4 3 7
## 4 3 2 1 3 1
## 5 8 7 10 9 7
## 6 1 2 10 3 1
## mitoses class
## 1 1 2
## 2 1 2
## 3 1 2
## 4 1 2
## 5 1 4
## 6 1 2
summary(breast_cancer)
## sampleCodeNumber clumpThickness uniformityCellSize uniformityCellShape
## Min. : 61634 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 870258 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 1171710 Median : 4.000 Median : 1.000 Median : 1.000
## Mean : 1071807 Mean : 4.417 Mean : 3.138 Mean : 3.211
## 3rd Qu.: 1238354 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000
## marginalAdhesion singleEpiCellSize bareNuclei blandChromatin
## Min. : 1.000 Min. : 1.000 Length:698 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 2.000 Class :character 1st Qu.: 2.000
## Median : 1.000 Median : 2.000 Mode :character Median : 3.000
## Mean : 2.809 Mean : 3.218 Mean : 3.438
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 5.000
## Max. :10.000 Max. :10.000 Max. :10.000
## normalNucleoli mitoses class
## Min. : 1.00 Min. : 1.00 Min. :2.000
## 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.:2.000
## Median : 1.00 Median : 1.00 Median :2.000
## Mean : 2.87 Mean : 1.59 Mean :2.691
## 3rd Qu.: 4.00 3rd Qu.: 1.00 3rd Qu.:4.000
## Max. :10.00 Max. :10.00 Max. :4.000
#type of attributes
str(breast_cancer)
## 'data.frame': 698 obs. of 11 variables:
## $ sampleCodeNumber : int 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 1035283 ...
## $ clumpThickness : int 5 3 6 4 8 1 2 2 4 1 ...
## $ uniformityCellSize : int 4 1 8 1 10 1 1 1 2 1 ...
## $ uniformityCellShape: int 4 1 8 1 10 1 2 1 1 1 ...
## $ marginalAdhesion : int 5 1 1 3 8 1 1 1 1 1 ...
## $ singleEpiCellSize : int 7 2 3 2 7 2 2 2 2 1 ...
## $ bareNuclei : chr "10" "2" "4" "1" ...
## $ blandChromatin : int 3 3 3 3 9 3 3 1 2 3 ...
## $ normalNucleoli : int 2 1 7 1 7 1 1 1 1 1 ...
## $ mitoses : int 1 1 1 1 1 1 1 5 1 1 ...
## $ class : int 2 2 2 2 4 2 2 2 2 2 ...
colnames(breast_cancer)
## [1] "sampleCodeNumber" "clumpThickness" "uniformityCellSize"
## [4] "uniformityCellShape" "marginalAdhesion" "singleEpiCellSize"
## [7] "bareNuclei" "blandChromatin" "normalNucleoli"
## [10] "mitoses" "class"
#Find missing values and plot to see the proportions w.r.t all the valid values of each variable
plotMissingValues <- function(value_df, figNum, showPlot=T){
if(showPlot){
missing <- apply(value_df, 2, function(x){sum(x=='?')})
value_df <- data.frame(colNames=names(missing), missing = missing[] )
row.names(value_df) <- NULL #remove rownames
}
p2 <- value_df %>%
filter_all(all_vars(.>0)) %>%
#sort the names column by missing values, for ggplot
mutate(colNames = reorder(colNames, missing)) %>%
pivot_longer(c(missing), values_to = "Values", names_to = "ValueType") %>%
ggplot(aes(x=colNames, y=Values, fill="red"))+
geom_bar(position="stack", stat="identity", show.legend = F)+
geom_text(aes(label = Values), nudge_y = 1)+
labs( x = "Attributes", y="#Missing Values")+
scale_fill_brewer(palette = "Set1")+
ggtitle(paste("Fig", figNum, ". Missing values"))
if(!showPlot){
return(
p2+
coord_flip()+
theme(axis.text.x = element_text(vjust = 0.5, hjust=1))
)
}else{
p2+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
}
}
#Plots all values, and shows missing and valid values as proportions
plotAllValues <- function(data, figNum){
missing <- apply(data, 2, function(x){sum(x=='?')})
valid <- apply(data, 2, function(x){sum(x!='?')})
value_df <- data.frame(colNames=names(missing), missing = missing[], valid = valid[] )
row.names(value_df) <- NULL #remove rownames
p1 <- value_df %>%
#sort the names column by missing values, for ggplot
mutate(colNames = reorder(colNames, missing)) %>%
pivot_longer(c(missing, valid), values_to = "Values", names_to = "ValueType") %>%
ggplot(aes(x=colNames, y=Values, fill=ValueType))+
geom_bar(position="stack", stat="identity")+
coord_flip()+
labs(fill="Value Type", x = "Attributes", y="#Values")+
scale_fill_brewer(palette = "Set1")+
ggtitle(paste("Fig", figNum, ". Proportion of missing values"))
p2 <- plotMissingValues(value_df, figNum+1, FALSE)
ggarrange( p1, p2, ncol=1, nrow = 2, heights = c(120, 120), align="v")
}
Let us examine missing and zero values
plotAllValues(breast_cancer, 1)
## Warning: Graphs cannot be vertically aligned unless the axis parameter is set.
## Placing graphs unaligned.
#Using this plot we can see that only bareNuclei has 16 missing values
apply(breast_cancer, 2, function(x){sum(x=='?')})
## sampleCodeNumber clumpThickness uniformityCellSize uniformityCellShape
## 0 0 0 0
## marginalAdhesion singleEpiCellSize bareNuclei blandChromatin
## 0 0 16 0
## normalNucleoli mitoses class
## 0 0 0
#bareNuclei has 16 missing values which we must remove for our ML analyses
#Remove missing values
breast_cancer = subset(breast_cancer, bareNuclei !='?')
#Validation
breast_cancer %>% filter(bareNuclei == '?')
## [1] sampleCodeNumber clumpThickness uniformityCellSize
## [4] uniformityCellShape marginalAdhesion singleEpiCellSize
## [7] bareNuclei blandChromatin normalNucleoli
## [10] mitoses class
## <0 rows> (or 0-length row.names)
#No zero values
apply(breast_cancer, 2, function(x){sum(x==0)})
## sampleCodeNumber clumpThickness uniformityCellSize uniformityCellShape
## 0 0 0 0
## marginalAdhesion singleEpiCellSize bareNuclei blandChromatin
## 0 0 0 0
## normalNucleoli mitoses class
## 0 0 0
Bare nuclei is char, but has int values, so let us convert it int.
breast_cancer$bareNuclei = as.integer(breast_cancer$bareNuclei)
unique(breast_cancer$class) #confirming that there are only 2 unique class values
## [1] 2 4
breast_cancer$class = factor(breast_cancer$class,levels = c('2', '4'), labels = c('B', 'M'))
levels(breast_cancer$class)
## [1] "B" "M"
# head(breast_cancer)
#plots the distribution of variables in both bar and boxplot, colored by the class of the tumor
plotDistributionWithClass <-function(db, target, xTitle, figNum){
p1 <- db %>%
ggplot(aes(x=.data[[target]], fill=class))+
geom_bar()+
labs(subtitle=paste0("Fig. ", figNum, ". Distribution of ", tolower(xTitle), " w.r.t class"),
x = xTitle,
fill="Classification")
#boxplot
p2 <- ggplot(db, aes(y= .data[[target]], x = class, fill=class))+
geom_boxplot(show.legend = F)+
labs(subtitle=paste0("Fig. ", figNum+1, ". Classification w.r.t ", tolower(xTitle)),
x = "Classification",
y= xTitle)
ggarrange(p1, p2, nrow=1, ncol=2, align="h", heights = c(1,1), common.legend = T, legend = "bottom")
}
#Boxplot without any class segregation
plotBox <-function(db, target, yTitle, figNum=""){
ggplot(db, aes(y= .data[[target]]))+
geom_boxplot()+
show_mean_crossbar+
labs(subtitle=ifelse(figNum!="", paste("Fig", figNum, ". Distribution of", tolower(yTitle)), ""),
y= yTitle)
}
#with class
plotDensity <- function(db, target, xTitle, figNum){
db %>%
ggplot(aes(x=.data[[target]], fill=class))+
geom_density(position="stack")+
labs(
subtitle = paste("Fig", figNum, ". Density distribution of", tolower(xTitle)),
x = xTitle,
fill="Classification")
}
plotDistributionNoClass<- function(db, target, xTitle, figNum){
p1 <- plotBox(db, target, xTitle, figNum)
p2 <- plotDensity(db, target, xTitle, figNum+1)
ggarrange(p1, p2, nrow=1, ncol=2, common.legend = T, legend="bottom")
}
#generic function to transform the column by passing the type of transformation
transformCol <- function(db, orig, newColName, func){
return (db %>%
mutate({{newColName}} := round(func(get(orig)), 2)) %>%
select(-{{orig}})
)
}
#creates a {hello: "hello"} mapping for each column to avoid typing errors when referring to column names
createListColnames <- function(db){
columnNames = colnames(db)
cNames <- setNames(as.list(columnNames), columnNames)
return(cNames)
}
cNames <- createListColnames(breast_cancer)
# cNames$clumpThickness #check
#shows mean crossbar on a box plot
show_mean_crossbar <- stat_summary(fun=mean, aes(x=0), geom="point", shape=3, size=3, col="red")
all_cols <- breast_cancer %>%
pivot_longer(c(-class,-sampleCodeNumber), names_to="Features", values_to = "Values")
#with class
all_cols %>%
ggplot(aes(x=class, y=Values, fill=class))+
geom_boxplot()+
facet_wrap(~Features, scales="free")+
labs(subtitle="Fig. 3. Distribution curves for all attributes w.r.t tumor classification")
#since boxplot shows lot of lower end values in benign tumors, let us see how many lower end values are present in our dataset for the attributes which have many outliers like bareNuclei, marginalAdhesion,normalNucleoli, singleEpiCellSize, uniformityCellSize
breast_cancer %>%
filter(class=='B') %>%
select(bareNuclei, marginalAdhesion, normalNucleoli, singleEpiCellSize, uniformityCellSize) %>%
apply(2, function(x){sum(x <=5)}) #values lower than 5, almost ~400 for these columns
## bareNuclei marginalAdhesion normalNucleoli singleEpiCellSize
## 437 439 434 437
## uniformityCellSize
## 440
Distributions show the contributing variables which have significantly different median in benign and malignant tumors. There are a lot of outliers in benign tumor class. Let us look at the distribution of the data as a whole, without any class segregation.
#Plots for all values, without class differentiation
#density plot to see the shape of the distribution
all_cols %>%
ggplot(aes(x=Values))+
geom_density()+
facet_wrap(~Features, scales="free")+
labs(subtitle="Fig. 4. Distribution curves for all attributes")
#bar
all_cols %>%
ggplot(aes(x=Values))+
geom_bar()+
facet_wrap(~Features, scales="free")+
labs(subtitle="Fig. 5. Count distribution for all attributes")
#boxplot distribution see range, min, max and mean of values
all_cols %>%
ggplot(aes(y=Values))+
geom_boxplot()+
stat_summary(fun=mean, aes(x=0), geom="point", shape=3, size=3, col="red")+
facet_wrap(~Features, scales="free")+
labs(subtitle="Fig. 6. Range distribution for all attributes")
As we can see in the Figure 3., there are a lot of outliers in all the columns except for bland chromatin and clump thickness but that is only for benign tumors. When looking at the data closely for only benign tumors, it shows that since these samples are benign tumors, most of these have lower values (<5) which is justified, and hence other values appear as outliers. When looking at Figures 4-6, overall distribution of values with no class segregation, we see that only mitoses column has considerable number of outliers. We will handle this when it we do ML analysis so as not to skew our model predictions due to these outliers. Let us sort mitoses values to further examine its outliers
breast_cancer %>%
select(mitoses) %>%
arrange(mitoses) %>%
count(mitoses)
## mitoses n
## 1 1 562
## 2 2 35
## 3 3 33
## 4 4 12
## 5 5 6
## 6 6 3
## 7 7 9
## 8 8 8
## 9 10 14
This shows that values of 5-8 are outliers as there are very few data points with those values compared to the other values. We will keep them in for the purpose of exploration.
checkNormality <- function(.data, variable, figNum){
dataCol <- .data[[variable]]
#shapiro test
sp<-shapiro.test(dataCol)
paste('Raw data', sp)
#density graph
p1 <- ggplot(.data, aes_string(x=dataCol)) +
geom_density()
#qq plot
p2 <- ggplot(.data, aes_string(sample=dataCol))+
stat_qq() +
stat_qq_line() +
labs(x ="theoretical",
subtitle=paste("Original: Shapiro:", round(sp$p.value, 3)))
#Log transformation
#shapiro test
sp<-shapiro.test(log2(dataCol))
#density graph
p3 <- .data %>%
mutate(logVar = log2(get(variable))) %>%
ggplot(aes(x=logVar)) +
geom_density()
#qq plot
p4 <- .data %>%
mutate(logVar = log2(get(variable))) %>%
ggplot(aes(sample=logVar))+
stat_qq() +
stat_qq_line() +
labs(x ="theoretical",
subtitle=paste("Log2 transformed: Shapiro:", round(sp$p.value, 3)))
#sqrt transformation
sp<-shapiro.test(sqrt(dataCol))
#density graph
p5 <- .data %>%
mutate(sqrtVar = sqrt(get(variable))) %>%
ggplot(aes(x=sqrtVar)) +
geom_density()
#qq plot
p6 <- .data %>%
mutate(sqrtVar = sqrt(get(variable))) %>%
ggplot(aes(sample=sqrtVar))+
stat_qq() +
stat_qq_line() +
labs(x ="theoretical",
subtitle=paste("Sqrt transformed: Shapiro:", round(sp$p.value, 3)))
plot <- ggarrange(p1, p2, p3, p4, p5, p6, nrow=3, ncol=2)
annotate_figure(plot, top=text_grob(paste0("Fig. ", figNum,". ", variable)))
}
#let us do the normality checks for the rest of the features
breast_cancer %>%
checkNormality(cNames$clumpThickness, 7)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
breast_cancer %>%
checkNormality(cNames$marginalAdhesion, 8)
breast_cancer %>%
checkNormality(cNames$uniformityCellSize, 9)
breast_cancer %>%
checkNormality(cNames$uniformityCellShape, 10)
breast_cancer %>%
checkNormality(cNames$singleEpiCellSize, 11)
breast_cancer %>%
checkNormality(cNames$bareNuclei, 12)
breast_cancer %>%
checkNormality(cNames$blandChromatin, 13)
breast_cancer %>%
checkNormality(cNames$normalNucleoli, 14)
breast_cancer %>%
checkNormality(cNames$mitoses, 15)
Looking at the distribution of various columns, we can see that most of our column data values are not normally distributed. We will keep this aspect in mind when designing ML models. We also see clump thickness looks better after transformations. So, let us transform these columns. For single epithelial cell size, even though original distribution is shifted more to the center, it looks like a normal distribution with a tail, so we will keep it untransformed.
#For clump thickness, even though shapiro test says that sqrt() distribution is not normal, density plot and qqplot show it to be fairly normal. Since shapiro test is highly sensitive, we will ignore shapiro test and proceed with transforming the column with sqrt
#So let us do a square root transformation of clump thickness
bc_transformed <- transformCol(breast_cancer, cNames$clumpThickness, 'clumpThickness_sq', sqrt)
#For single epithelial cell size, it looks like a normal distribution with a tail, so we will keep it as is.
cNames <- createListColnames(bc_transformed) #update the list
Let us look at a few more features of clump thickness in detail.
plotDistributionWithClass(bc_transformed, cNames$clumpThickness_sq, "Clump Thickness", 16)
plotDistributionNoClass(bc_transformed, cNames$clumpThickness_sq, "Clump Thickness", 17)
#Let us do a stat test to confirm the observation in the box plot
t.test(clumpThickness_sq~class, data = bc_transformed) #p-value < 2.2e-16
##
## Welch Two Sample t-test
##
## data: clumpThickness_sq by class
## t = -24.677, df = 496.24, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group B and group M is not equal to 0
## 95 percent confidence interval:
## -1.0699283 -0.9121214
## sample estimates:
## mean in group B mean in group M
## 1.644289 2.635314
As we can see from the bar and density plots in Fig 16 and 18., more malignant tumors have higher values of clump thickness whereas more benign tumors have lower values of clump thickness. Looking at the box plot in Fig 17., we can see there is an appreciable amount of difference in the mean value of clump thickness for both the classes of tumors. With a significant p-value in t-test, we can say there is a significant difference in clump thickness values of Benign vs Malignant tumors and hence clump thickness could be a potential contributor to detecting malignant cancers.
name = "Marginal Adhesion"
plotDistributionWithClass(bc_transformed, cNames$marginalAdhesion, name, 19)
plotDistributionNoClass(bc_transformed, cNames$marginalAdhesion, name, 20)
#Let us do a stat test to confirm the observation in the box plot
t.test(marginalAdhesion~class, data = bc_transformed) #p-value < 2.2e-16
##
## Welch Two Sample t-test
##
## data: marginalAdhesion by class
## t = -20.055, df = 259.37, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group B and group M is not equal to 0
## 95 percent confidence interval:
## -4.654271 -3.822018
## sample estimates:
## mean in group B mean in group M
## 1.347630 5.585774
Figures 19-21 show that most of the samples have low adhesion values. We can see a clear distinction in the boxplot of Fig 20. that malignant cancers seem to have a higher mean value of adhesion compared to benign tumors. Also, in the boxplot there are quite a few outliers in the marginal adhesion values and those are mostly for benign tumor samples. From both bar and density plots we can see that malignant tumors are distributed throughout the entire range of values, whereas benign tumors mostly have very low marginal adhesion values. Note: Even though it looks like it opposite to the expected trend of marginal adhesion, but in this dataset, the lower values of each attribute indicate benign properties and higher value tends towards malignancy, and hence the distribution makes sense. Let us see sort these values to analyze them further
bc_transformed %>%
select(marginalAdhesion) %>%
arrange(marginalAdhesion) %>%
count(marginalAdhesion)
## marginalAdhesion n
## 1 1 392
## 2 2 58
## 3 3 58
## 4 4 33
## 5 5 23
## 6 6 21
## 7 7 13
## 8 8 25
## 9 9 4
## 10 10 55
So maximum values are 1 for 392 samples, and the distribution looks fairly even for the rest of the samples. So we cannot say that there are certain outliers which are very far from the mean value. Outliers show up in the graph because almost 400 out of 700 samples having a single value, but that is expected since 65% of the tumors are benign in our dataset, so majority of the attribute values will be on the lower end. This also indicates that our ML model might be more biased towards benign tumors in its prediction.
name = "Uniformity of cell shape"
plotDistributionWithClass(bc_transformed, cNames$uniformityCellShape, name, 21)
plotDistributionNoClass(bc_transformed, cNames$uniformityCellShape, name, 22)
t.test(uniformityCellShape~class, data=bc_transformed) #significant, p-value < 2.2e-16
##
## Welch Two Sample t-test
##
## data: uniformityCellShape by class
## t = -29.862, df = 274.21, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group B and group M is not equal to 0
## 95 percent confidence interval:
## -5.484520 -4.806119
## sample estimates:
## mean in group B mean in group M
## 1.415350 6.560669
In Figure 21, we can see that benign tumors have fairly uniform cell shape having low values, and for malignant tumors, mean value of the attribute is significantly higher (Fig 22. and t-test). Therefore, this attributes shows a similar trend as well.
Let us plot the correlations between various attributes and visualize.
corrMatrix <- bc_transformed %>%
select(-sampleCodeNumber, -class) %>%
cor()
corrplot(corrMatrix, method = "pie", type="lower", title="Fig. 23. Correlation heatmap/matrix", mar=c(0,0,2,0))
hc <- hclust(dist(corrMatrix), method = "complete")
plot(hc, main = "Fig. 24. Dendrogram of attributes", xlab = "Correlation distance", col = "orange", hang = -200, sub="")
# Adding labels to the branches
rect.hclust(hc, k = 3, border = 2:10)
In the correlation plot, we can notice that uniformity in cell shape and size correlate with many of the other variables (first 2 columns in the Fivure). Checking the correlation matrix values, we can also see that they correlate with each other pretty strongly ~0.907, so we must handle this before ML modelling. Clump thickness also correlates with most of the variables (bottom most row). Dendrogram shows us how similar or closely related the attributes are. As also seen in the correlation matrix, uniformity of cell shape and size are closely related. Marginal adhesion, bare nuclei, and bland chromatin shows similar relation in the dendrogram, being more related compared to others. As evident from both corrplot and dendrogram, mitoses has the least correlation with other variables.
Since the dimensionality of our dataset is not very high, and has only 9 attributes, we do not need to do dimension reduction. So we will proceed with making ML models for our data.
For our logistic regression, we will make glm as well as random forest models. Let us first check the correlation of various continuous variables again with graphs
bc_transformed %>%
select(-sampleCodeNumber) %>%
ggpairs(aes(color=class), progress=FALSE)+
labs(title="Fig. 25. Correlations")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Seeing the correlations, we can see that the uniformity in cell shape and size is very strongly correlated, and we should remove one of these variables to proceed with our ML model.
Note, that many of our features were not normally distributed, and cannot be transformed so we will proceed with them and keep this caveat in consideration. Let us now center our data.
db_c <- bc_transformed %>%
select(-uniformityCellShape, -sampleCodeNumber) %>% #removing cell shape column
#this centers the data around the mean by subtracting it from the mean
mutate(across(where(is.numeric), ~.-mean(.)))
set.seed(137)
dim(db_c) #682x9
## [1] 682 9
#let us keep 1/3 data for testing: ~230
test_rows <- sample(1: dim(db_c)[1], 230)
db_test <- db_c[test_rows, ]
db_train <- db_c[-test_rows, ]
#Add Test/Train symbol to original dataset
db_c$Type <- "train"
db_c$Type[test_rows] <- "test"
# head(db_c)
#For continuous variables
db_c %>%
pivot_longer(c(-class, -Type), names_to = "measure", values_to ="values") %>%
ggplot(aes(y=values, x=Type))+
facet_wrap(~measure, scales ="free")+
geom_boxplot()+
labs(title="Fig. 26. Distribution of features in testing and training data.")
#For categorical variables
db_c %>%
ggplot(aes(x=Type, fill=class))+
geom_bar(position="fill")+
labs(subtitle="Fig. 27. Distribution of tumor class in testing and training data",
fill="Classification",
x="Type of data")
In Fig 26., visual inspection shows that the distribution of all the continuous columns in our split dataset looks similar with no significant difference between the means of two classes. Let us do statistical inspection to confirm our observation. Fig 27. confirms that distribution of class column is also similar in both test and training subsets.
Let us do a stat inspection to confirm our visual observations of the split data.
#Factor the Type column
db_c$Type = factor(db_c$Type)
isFactor <- db_c %>%
sapply(is.factor)
#the continuous data variables
apply(db_c[, !isFactor], 2, function(x) t.test(x ~ db_c$Type)$p.value) #none with p<0.05 -> not significant
## uniformityCellSize marginalAdhesion singleEpiCellSize bareNuclei
## 0.4058383 0.4403141 0.5794051 0.3732211
## blandChromatin normalNucleoli mitoses clumpThickness_sq
## 0.8057863 0.7464925 0.6739432 0.1934219
#the categorical variables: class, Type
apply(db_c[, isFactor], 2, function(x) chisq.test(x, db_c$Type)$p.value) #none with p<0.05 -> not significant
## class Type
## 3.003177e-01 2.298030e-149
Our statistical inspection shows us the same thing. We see no significant correlation between our test and training data, which confirms that the variable distributions are similar in both the subsets and our split decision is correct.
We will now proceed with developing our logistic regression model.
class.summary <- function(y, pred, pred.prob){
tab <- table(y=y, pred=pred) # Confusion matrix
acc <- (sum(diag(tab))) / sum(tab) # Accuracy rate
mis <- (sum(tab) - sum(diag(tab)))/sum(tab) # Misclassification Rate
# Deviance
if(any(pred.prob == 0))
pred.prob[pred.prob == 0] <- 0.001
if(any(pred.prob == 1))
pred.prob[pred.prob == 1] <- 1 - 0.001
dev <- -2*sum((y=="S")*log(pred.prob) + (1-(y=="S"))*log(1-pred.prob))
#sensitivity and specificity
#
conf <- tab # extract confusion matrix
sens <- conf[2,2]/sum(conf[2,]) # sensitivity
speci <- conf[1,1]/sum(conf[1,]) # specificity
fp <- 1 - speci # false positive rate
fn <- 1 - sens # false negative rate
return(list(conf.tab=tab, acc=round(acc,3), missclass=round(mis,3), dev=round(dev,3), sens=round(sens,3), speci=round(speci,3), fPos=round(fp,3), fNeg=round(fn,3)))
}
#Inputting all the variables
db.glm <- glm(class ~ ., data = db_train , family = binomial)
summary(db.glm)
##
## Call:
## glm(formula = class ~ ., family = binomial, data = db_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2350 0.4131 -2.990 0.002790 **
## uniformityCellSize 0.3629 0.1969 1.843 0.065286 .
## marginalAdhesion 0.3042 0.1343 2.264 0.023549 *
## singleEpiCellSize 0.1604 0.1762 0.910 0.362605
## bareNuclei 0.3724 0.1038 3.589 0.000332 ***
## blandChromatin 0.3887 0.2105 1.847 0.064750 .
## normalNucleoli 0.1996 0.1302 1.534 0.125094
## mitoses 0.6579 0.3611 1.822 0.068491 .
## clumpThickness_sq 2.1933 0.7388 2.969 0.002989 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 593.26 on 451 degrees of freedom
## Residual deviance: 71.45 on 443 degrees of freedom
## AIC: 89.45
##
## Number of Fisher Scoring iterations: 8
#Testing our model on test data
#Numerical probabilities
db.pred.glm.p <- predict(db.glm, newdata = db_test, type = "response")
hist(db.pred.glm.p, main="Fig. 28. Distribution of probabilities") #histogram shows 2 peaks, let us look into this further
#Check distribution of B and M in test data
db_test %>%
count(class) #almost 67% are benign, and rest are malignant, hence our probability histogram has 2 peaks
## class n
## 1 B 156
## 2 M 74
#Since there a lot more benign in our dataset than malignant tumors, we will keep the probability of malignant a little bit on the lower end, so to reduce the probability of a false negative.
thresh_sick <- 0.4 #decided threshold based on histogram shape
db.pred.glm.fp <- factor(db.pred.glm.p > thresh_sick, labels=c("B", "M"))
#Table for performance parameters
summary.db.pred.glm <- class.summary(y=db_test$class, pred=db.pred.glm.fp, pred.prob = db.pred.glm.p)
#ROC curve
plot_data <- data.frame(D=db_test$class, p=db.pred.glm.p)
#ggplot
ggplot(plot_data, aes(d = D, m = p)) +
geom_roc(n.cuts = 12)+
labs(subtitle="Fig. 29. True positives and false positives")
## Warning in verify_d(data$d): D not labeled 0/1, assuming B = 0 and M = 1!
#AUC
a_glm <- auc(db_test$class, db.pred.glm.p)
## Setting levels: control = B, case = M
## Setting direction: controls < cases
a_glm #99.7%
## Area under the curve: 0.997
lm_test <-c(unlist(summary.db.pred.glm[2:6]),AUC=a_glm)
data.frame(lm_test)
## lm_test
## acc 0.9740000
## missclass 0.0260000
## dev 1013.4380000
## sens 0.9460000
## speci 0.9870000
## AUC 0.9969681
Seeing the performance parameters, it has very high accuracy ~97%, very low missclassification ~0.03, but extremely high deviance ~1000. Area under the curve is ~99% This indicates our model is predicting the class of the tumor correctly most of the times, but when it is mispredicting, it is getting it really wrong such that model fit to the data is extremely deviated. This also reflects the imbalance in our dataset, with only 30% of the data being M class. Let us try to optimize this model.
#Using step model to optimize the model created above
set.seed(137)
db.glm.optimized <- step(db.glm, scope=list(upper=db.glm, lower=~1, direction=c("both")), trace=FALSE)
summary(db.glm.optimized)
##
## Call:
## glm(formula = class ~ uniformityCellSize + marginalAdhesion +
## bareNuclei + blandChromatin + normalNucleoli + mitoses +
## clumpThickness_sq, family = binomial, data = db_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1694 0.4014 -2.914 0.00357 **
## uniformityCellSize 0.4148 0.1946 2.131 0.03307 *
## marginalAdhesion 0.3114 0.1364 2.283 0.02245 *
## bareNuclei 0.3843 0.1030 3.731 0.00019 ***
## blandChromatin 0.3890 0.2100 1.852 0.06398 .
## normalNucleoli 0.2204 0.1321 1.669 0.09508 .
## mitoses 0.6500 0.3540 1.836 0.06633 .
## clumpThickness_sq 2.1276 0.7321 2.906 0.00366 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 593.264 on 451 degrees of freedom
## Residual deviance: 72.289 on 444 degrees of freedom
## AIC: 88.289
##
## Number of Fisher Scoring iterations: 8
#Numerical probabilities
db.pred.glm.optimized.p <- predict(db.glm.optimized, newdata = db_test, type = "response")
hist(db.pred.glm.optimized.p, main="Fig. 30. Distribution of probabilities")
#Factored probabilities
# db.pred.glm.optimized.fp <- factor(ifelse(db.pred.glm.optimized.p > thresh_sick, "H", "S"))
db.pred.glm.optimized.fp <- factor(db.pred.glm.optimized.p > thresh_sick, labels=c("H", "S"))
#Table for performance parameters
summary.db.pred.glm.optimized <- class.summary(y=db_test$class, pred=db.pred.glm.optimized.fp, pred.prob = db.pred.glm.optimized.p)
#ROC curve
plot_data <- data.frame(D=db_test$class, p=db.pred.glm.optimized.p)
#ggplot
ggplot(plot_data, aes(d = D, m = p)) +
geom_roc(n.cuts = 12)+
labs(subtitle="Fig. 31. True positives and false positives")
## Warning in verify_d(data$d): D not labeled 0/1, assuming B = 0 and M = 1!
#AUC
a_op <- auc(db_test$class, db.pred.glm.optimized.p)
## Setting levels: control = B, case = M
## Setting direction: controls < cases
a_op #99.7%
## Area under the curve: 0.9971
lm_optimized <-c(unlist(summary.db.pred.glm.optimized[2:6]),AUC=a_op)
data.frame(lm_test, lm_optimized)
## lm_test lm_optimized
## acc 0.9740000 0.9740000
## missclass 0.0260000 0.0260000
## dev 1013.4380000 1004.5100000
## sens 0.9460000 0.9460000
## speci 0.9870000 0.9870000
## AUC 0.9969681 0.9970547
#To compare 2 models
anova(db.glm.optimized, db.glm, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: class ~ uniformityCellSize + marginalAdhesion + bareNuclei +
## blandChromatin + normalNucleoli + mitoses + clumpThickness_sq
## Model 2: class ~ uniformityCellSize + marginalAdhesion + singleEpiCellSize +
## bareNuclei + blandChromatin + normalNucleoli + mitoses +
## clumpThickness_sq
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 444 72.289
## 2 443 71.450 1 0.83914 0.3596
Looking at the model summary, we can see optimized model excludes single epithalial cell size attribute. Even though the optimization did not improve the model performance much in terms of accuracy, sensitivity, specificity or AUC, but it reduces AUC by 10 units and uses 1 variables less than the unoptimized model. So our optimized model performs similarly and is a bit simpler.
Let us try a random forest to see if we can figure out the variables in terms of their importance.
set.seed(137)
#Make the model using all variables first
db.rf <- randomForest(class~., data=db_train, mtry=5, nodesize=3, importance=T, proximity=T, na.action=na.omit)
summary(db.rf)
## Length Class Mode
## call 8 -none- call
## type 1 -none- character
## predicted 452 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 904 matrix numeric
## oob.times 452 -none- numeric
## classes 2 -none- character
## importance 32 -none- numeric
## importanceSD 24 -none- numeric
## localImportance 0 -none- NULL
## proximity 204304 -none- numeric
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 452 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
varImpPlot(db.rf, main="Fig. 32. Estimate of variable importance")
#Lowest contributor is mitoses, and other low contributors are marginal adhesion, single epi cell size.
We tried removing various combinations of these 3 contributors such as mitoses only, marginal adhesion only, single epi only, all 3. (work not shown) The one which gave the least deviation is removal of marginal adhesion, so let us see that.
set.seed(137)
#Let us optimize the model by removing low contributors
db.rf.optimized <- randomForest(class~.-marginalAdhesion, data=db_train, mtry=5, nodeSize =5, importance=T, proximity=T, na.action=na.omit)
summary(db.rf.optimized)
## Length Class Mode
## call 8 -none- call
## type 1 -none- character
## predicted 452 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 904 matrix numeric
## oob.times 452 -none- numeric
## classes 2 -none- character
## importance 28 -none- numeric
## importanceSD 21 -none- numeric
## localImportance 0 -none- NULL
## proximity 204304 -none- numeric
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 452 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
#Factored probabilities
db.rf.optimized.fp <- predict(db.rf.optimized, newdata=db_test, type="response", aggregation="average")
#Numerical probabilities
db.rf.optimized.p <- predict(db.rf.optimized, newdata=db_test, type="prob", aggregation="average")[,2]#only need Sick Probabilities
#ROC plot
plot_data_rf <- data.frame(D=db_test$class, p=db.rf.optimized.p)
#ggplot
ggplot(plot_data_rf, aes(d = D, m = p)) +
geom_roc(n.cuts = 12)+
labs(subtitle="Fig. 33. True positives and false positives")
## Warning in verify_d(data$d): D not labeled 0/1, assuming B = 0 and M = 1!
#Summary table
sumry.db.rf.optimized <-class.summary(db_test$class, db.rf.optimized.fp, db.rf.optimized.p) #AUC
a_rf_optimized <- auc(db_test$class, db.rf.optimized.p)
## Setting levels: control = B, case = M
## Setting direction: controls < cases
rf_optimized <- c(unlist(sumry.db.rf.optimized[2:6]),AUC=a_rf_optimized) #81%
data.frame(lm_test,lm_optimized, rf_optimized)
## lm_test lm_optimized rf_optimized
## acc 0.9740000 0.9740000 0.9570000
## missclass 0.0260000 0.0260000 0.0430000
## dev 1013.4380000 1004.5100000 617.9210000
## sens 0.9460000 0.9460000 0.9320000
## speci 0.9870000 0.9870000 0.9680000
## AUC 0.9969681 0.9970547 0.9940229
Random forest model looks much better without compromising a lot on accuracy, and other parameters. It reduces the deviance by almost 50%, bringing it down to 617. Deviance is still very high, so we will try to tweak some parameters to see if we can improve it.
#Let us try different mtry values to optimize
set.seed(137)
db.rf.mod <- randomForest(class~., data=db_train, mtry=4, nodesize=2, importance=T, proximity=T, na.action=na.omit)
db.rf.mod.fp <- predict(db.rf.mod, newdata=db_test, type="response", aggregation="average")
db.rf.mod.p <- predict(db.rf.mod, newdata=db_test, type="prob", aggregation="average")[,2]
#ROC plot
plot_data_rf_mod <- data.frame(D=db_test$class, p=db.rf.mod.p)
#ggplot
ggplot(plot_data_rf_mod, aes(d = D, m = p)) +
geom_roc(n.cuts = 12)+
labs(subtitle="Fig. 34. True positives and false positives")
## Warning in verify_d(data$d): D not labeled 0/1, assuming B = 0 and M = 1!
sumry.db.rf.mod <-class.summary(db_test$class, db.rf.mod.fp, db.rf.mod.p)
a_rf_mod<-auc(db_test$class, db.rf.mod.p)
## Setting levels: control = B, case = M
## Setting direction: controls < cases
rf_mod <- c(unlist(sumry.db.rf.optimized[2:6]),AUC=a_rf_mod)
data.frame(lm_test,lm_optimized, rf_optimized, rf_mod)
## lm_test lm_optimized rf_optimized rf_mod
## acc 0.9740000 0.9740000 0.9570000 0.9570000
## missclass 0.0260000 0.0260000 0.0430000 0.0430000
## dev 1013.4380000 1004.5100000 617.9210000 617.9210000
## sens 0.9460000 0.9460000 0.9320000 0.9320000
## speci 0.9870000 0.9870000 0.9680000 0.9680000
## AUC 0.9969681 0.9970547 0.9940229 0.9946292
Trying different values of mtry doesn’t yield any benefit in terms of deviance. Let us try removing the outliers which we observed in our initial EDA, and see if our ML model shows an improvement
Seeing the Figure 6. for boxplot distributions of all columns before, we can notice that columns mitoses, normalNucleoi, singleEpiCellSize have quite a few outliers. Let us remove these and see if RF model improves.
#we will check whether there are outliers i.e. 3SD away from mean
#returns true when there are outliers, else returns false.
hasOutliers <- function(.data, col){
mn <- mean(col)
sd <- sd(col)
sd_3 <- c(mn-3*sd, mn+3*sd)
# print(paste("Mean", round(mn, 2), "3SD range", round(sd_3[1], 2), round(sd_3[2], 2) ))
#check if the data has values outside 3SD range
if(any(range(.data$normalNucleoli) %][% sd_3)){
return(T)
}
return(F)
}
#Returns a new db with outliers removed
removeOutliers <- function(.data, column){
colValues <- .data[[column]]
if(!hasOutliers(.data, colValues)){
return(.data)
}
# Calculate mean and standard deviation
mean_val <- mean(colValues)
sd_val <- sd(colValues)
# Identify indices of outliers, ones which are greater than 3SD from mean
outlier_indices <- which(abs(colValues - mean_val) > (3 * sd_val))
# print(outlier_indices)
# print(.data[outlier_indices,])
colName <- sym(column)
# Remove outliers
cleaned_data <- .data[-outlier_indices,]
return(cleaned_data)
}
#Mitoses
cleaned_data <- removeOutliers(bc_transformed, cNames$mitoses)
cNames <- createListColnames(cleaned_data) #update name list
#distribution before
p1 <- plotBox(bc_transformed, cNames$mitoses, "Mitoses with outliers")
#distribution after outlier removal
p2 <- plotBox(cleaned_data, cNames$mitoses, "Mitoses without outliers")
g_plot <- ggarrange(p1, p2, nrow=1, ncol=2, common.legend = T, legend = "bottom")
annotate_figure(g_plot, top=text_grob(paste0("Fig. 35. Distribution of mitoses before and after outlier removal")))
# hasOutliers(cleaned_data, cleaned_data$singleEpiCellSize)
temp <- removeOutliers(cleaned_data, cNames$singleEpiCellSize)
p1 <- plotBox(cleaned_data, cNames$singleEpiCellSize, "Single epithelial cell size with outliers")
p2 <- plotBox(temp, cNames$singleEpiCellSize, "Single epithelial cell size without outliers")
g1_plot <- ggarrange(p1, p2, nrow=1, ncol=2)
annotate_figure(g1_plot, top=text_grob(paste0("Fig. 36. Distribution of singleEpi before and after outlier removal")))
cleaned <- temp
cNames <- createListColnames(cleaned)
Seeing the distribution, it changes slightly with the removal of outliers, with the range shrinking and mean shifting up (red crossbar)
hasOutliers(cleaned, cleaned$normalNucleoli) #NONE
## [1] FALSE
hasOutliers(cleaned, cleaned$blandChromatin) #NONE
## [1] FALSE
Graphs after outlier removal shows that the most distant points are now closer to the mean compared to before for Mitoses and Single Epithelial cell size values. There were no values beyond 3SD range for both normal nucleoli and bland chromatin columns. So we will build our RF model again. Let us first recenter our data.
db_c <- cleaned %>%
select(-uniformityCellShape, -sampleCodeNumber) %>%
#this centers the data around the mean by subtracting it from the mean
mutate(across(where(is.numeric), ~.-mean(.)))
set.seed(137)
dim(db_c) #631x9
## [1] 631 9
#let us keep 1/3 data for testing: ~190
test_rows <- sample(1: dim(db_c)[1], 190)
db_test <- db_c[test_rows, ]
db_train <- db_c[-test_rows, ]
#Add Test/Train symbol to original dataset
db_c$Type <- "train"
db_c$Type[test_rows] <- "test"
# head(db_c)
set.seed(137)
#or -> outlier removal
db.rf.or <- randomForest(class~., data=db_train, mtry=5, nodeSize =5, importance=T, proximity=T, na.action=na.omit)
summary(db.rf.or)
## Length Class Mode
## call 8 -none- call
## type 1 -none- character
## predicted 441 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 882 matrix numeric
## oob.times 441 -none- numeric
## classes 2 -none- character
## importance 32 -none- numeric
## importanceSD 24 -none- numeric
## localImportance 0 -none- NULL
## proximity 194481 -none- numeric
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 441 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
#Factored probabilities
db.rf.or.fp <- predict(db.rf.or, newdata=db_test, type="response", aggregation="average")
#Numerical probabilities
db.rf.or.p <- predict(db.rf.or, newdata=db_test, type="prob", aggregation="average")[,2]#only need Sick Probabilities
hist(db.rf.or.p, main="Fig. 37. Histogram of probabilities of being malignant")
#ROC plot
plot_data_rf <- data.frame(D=db_test$class, p=db.rf.or.p)
#ggplot
ggplot(plot_data_rf, aes(d = D, m = p)) +
geom_roc(n.cuts = 12)+
labs(subtitle="Fig. 38. True positives and false positives")
## Warning in verify_d(data$d): D not labeled 0/1, assuming B = 0 and M = 1!
#Summary table
sumry.db.rf.or <-class.summary(db_test$class, db.rf.or.fp, db.rf.or.p) #AUC
a_rf_or <- auc(db_test$class, db.rf.or.p)
## Setting levels: control = B, case = M
## Setting direction: controls < cases
rf_or <- c(unlist(sumry.db.rf.or[2:6]),AUC=a_rf_or) #81%
data.frame(lm_test,lm_optimized, rf_optimized, rf_or)
## lm_test lm_optimized rf_optimized rf_or
## acc 0.9740000 0.9740000 0.9570000 0.9530000
## missclass 0.0260000 0.0260000 0.0430000 0.0470000
## dev 1013.4380000 1004.5100000 617.9210000 411.9880000
## sens 0.9460000 0.9460000 0.9320000 0.9140000
## speci 0.9870000 0.9870000 0.9680000 0.9700000
## AUC 0.9969681 0.9970547 0.9940229 0.9948406
After removing all the outliers, the deviance dropped to ~413 and accuracy remained similar ~95.3%. Therefore, compared to our logistic regression model, this RF model is an improvement in deviance by 60% reduction from 1000 to 412. Hence, we made a well-performing random forest with ~96% accuracy, ~93%+ specificity and sensitivity, AUC of ~99% and deviance of 413 with only 0.04 misclasification rate.
In this project, along with logistic regression and random forest predictions for benign and malignant classes, I also aim to predict the degree of malignancy of a tumor. So we will create a new column which will contain continuous values of degree of malignancy of a tumor.
To calculate this degree of malignancy, I would like to be able to use all the attributes which are contributing to it, i.e. all columns, therefore we will do dimension reduction to achieve this. Let us do a PCA on our dataset.
getLabel <- function(n, variance){
paste0("PC", n,"(", scales::percent(variance), ")")
}
makePCPlots <- function(pca, color, figNum){
pca_x <- as.data.frame(pca$x)
pca_rot <- as.data.frame(pca$rotation)
variance<- summary(pca)$importance["Proportion of Variance", ]
# nPC <- dim(pca_x)[2]
p12 <- ggplot(pca_x, aes(PC1, PC2, color = color))+
geom_point()+
labs(
x = getLabel(1, variance[1]),
y = getLabel(2, variance[2]),
color="Classification")+
theme(legend.position = "top")
p34 <- ggplot(pca_x, aes(PC3, PC4, color = color))+
geom_point()+
labs(
x = getLabel(3, variance[3]),
y = getLabel(4, variance[4]),
color="Classification")+
theme(legend.position = "top")
p56 <- ggplot(pca_x, aes(PC5, PC6, color = color))+
geom_point()+
labs(x = getLabel(5, variance[5]),
y = getLabel(6, variance[6]),
color="Classification")+
theme(legend.position = "top")
p78 <- ggplot(pca_x, aes(PC7, PC8, color = color))+
geom_point()+
labs(x = getLabel(7, variance[7]),
y = getLabel(8, variance[8]),
color="Classification")+
theme(legend.position = "top")
p19 <- ggplot(pca_x, aes(PC2, PC9, color = color))+
geom_point()+
labs(x = getLabel(2, variance[2]),
y = getLabel(9, variance[9]),
color="Classification")+
theme(legend.position = "top")
g2 <- ggarrange(p12, p34, p56, p78, p19, nrow=2, ncol=3, common.legend = T, legend="top")
annotate_figure(g2, top=text_grob(paste0("Fig. ", figNum,". PCA plots.")))
}
numeric_bc <- cleaned %>%
select(-class,-sampleCodeNumber)
pca <- prcomp(numeric_bc, scale. = T)
# summary(pca)
makePCPlots(pca, cleaned$class, 39)
In Figure 39., we can see that PC1 and PC2, explains the maximum variance (64% & 8% resp.) in the dataset. So if we take PC1, PC2 to be our variables of interest, they can explain around 72% of the total variance in our data. So we will calculate the distance of each datapoint in this PC1/PC2 plot to the centroid of the benign cluster. So the farther a point is from that centroid, more chances of it being malignant, hence we can call this distance the degree of malignancy of a tumor. To find the benign cluster, we can look at the PC1/PC2 plot and notice, PC1 > -1.25 is mostly benign cluster. So let us consider that our threshold for getting all the data points of benign tumors and find its centroid.
#This function plots a big dot corresponding to the centroid of a cluster on a given pca
plotCentroidOnPCA <- function(pca, color, centroid, thresh, figNum ){
# color <- bc_transformed$class
pca_x <- as.data.frame(pca$x)
variance<- summary(pca)$importance["Proportion of Variance", ]
pca_matrix <- pca$x[, 1:2]
# malignant_cluster <- as.data.frame(pca_matrix) %>% filter(PC1 < thresh)
#
# centroid <- colMeans(malignant_cluster)
centroid_df <- data.frame(PC1=centroid[["PC1"]], PC2= centroid[["PC2"]])
ggplot(pca_x, aes(PC1, PC2, color=color))+
geom_point()+
geom_point(data=centroid_df, aes(x=PC1, y=PC2, color="Centroid(Benign)"), size=5)+
labs(
subtitle=paste0("Fig. ", figNum,". PCA plot with centroid of benign cluster."),
x = getLabel(1, variance[1]),
y = getLabel(2, variance[2]),
color="Classification")+
theme(legend.position = "top")
}
#Using PC1 and PC2 to be our variables of interest
pca_matrix <- pca$x[, 1:2]
# centroid <- colMeans(matrix_data) #centroid of the entire dataset
#Since datapoints in PC1/PC2 plot at PC1 > -1.25 mostly form the benign tumor cluster
threshold_for_benign <- -1.25
benign_cluster <- as.data.frame(pca_matrix) %>% filter(PC1 > threshold_for_benign)
centroid <- colMeans(benign_cluster) # centroid of malignant cluster, for points PC1<-1.25
plotCentroidOnPCA(pca, cleaned$class, centroid, threshold_for_benign, 40) #to see the location of centroid on PCA plot
# distances <- round(sqrt(rowSums((matrix_data - centroid)^2)), 2)
distances <- round(sqrt(rowSums(t(t(pca_matrix) - centroid)^2)),2)
# length(distances)
# print(distances)
cleaned$degreeOfMalignancy <- distances
cNames <- createListColnames(cleaned)
Let us examine the trends in this new column
#Check number of B and M tumors
cleaned %>%
select(class) %>%
count(class) #M: 191, B:440
## class n
## 1 B 440
## 2 M 191
#Range of degree of malignancy
range(cleaned$degreeOfMalignancy) #0.00 9.97
## [1] 0.01 9.88
#How many malignant cancers with high degree of malignancy
cleaned %>%
select(class, degreeOfMalignancy) %>%
filter(class=='M') %>%
arrange(degreeOfMalignancy) %>%
#just a view at how many malignant cancers have higher degree of malignancy
count(degreeOfMalignancy>=2) #True for 184 out of 191
## degreeOfMalignancy >= 2 n
## 1 FALSE 7
## 2 TRUE 184
cleaned %>%
select(class, degreeOfMalignancy) %>%
arrange(degreeOfMalignancy) %>%
filter(class== "B") %>% #max value of B tumor's malignancy is is 5.21 which is ok
count(degreeOfMalignancy<2) #Most benign tumors, 425/440 are under 2 d.o.M (degree of malignancy)
## degreeOfMalignancy < 2 n
## 1 FALSE 15
## 2 TRUE 425
# cleaned %>%
# select(class, degreeOfMalignancy) %>%
# arrange(degreeOfMalignancy) %>%
# filter(class== "M") #range of values: 0.4 to 9.97 for M tumors
#Let us plot the distribution of degree of malignancy over class
plotDistributionWithClass(cleaned, cNames$degreeOfMalignancy, "Degree of Malignancy", 41)
plotDensity(cleaned, cNames$degreeOfMalignancy, "Degree of Malignancy", 42)
Our analysis of d.o.M’s density plot in Figure 42, shows us an expected trend, with many benign tumors in the dataset having low values of d.o.M, whereas, malignant tumors take a wide range of values from 0.4 to 9.9. Having maximum benign tumors with really low d.o.M values is a good indicator of the 2 classes based on our degree of malignancy calculations.
We will now make a linear regression model for the degree of malignancy prediction.
Let us first check how does the linear relationship of all the variables with our outcome variable i.e. degree of malignancy looks.
checkLinearity <- function(data, input, output, figNum){
ggplot(data, aes(x=.data[[input]], y =.data[[output]])) +
geom_point() +
# the linear fit line
geom_smooth (method = "lm", aes(colour = "linear")) +
#the nonlinear fit line (loess)
geom_smooth(aes(colour = "loess")) +
scale_colour_manual("Line fit", breaks =c("linear", "loess"), values = c("blue", "lightgreen"))+
labs(subtitle=paste0("Fig. ", figNum,". Degree of malignancy vs ", input))
}
#Plotting in grid doesn't give a clear picture, so we will plot them individually
p1 <-checkLinearity(cleaned, cNames$bareNuclei, cNames$degreeOfMalignancy, 43) #OK
p2 <-checkLinearity(cleaned, cNames$singleEpiCellSize, cNames$degreeOfMalignancy, 44) #looks bad
ggarrange(p1, p2, nrow=2, ncol=1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 0.96
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.04
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 5.2696e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.96
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.04
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 5.2696e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1
p3 <- checkLinearity(cleaned, cNames$uniformityCellSize, cNames$degreeOfMalignancy, 45) #looks OK
p4 <- checkLinearity(cleaned, cNames$blandChromatin, cNames$degreeOfMalignancy, 46) #looks much better
ggarrange(p3, p4, nrow=2, ncol=1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
p5 <- checkLinearity(cleaned, cNames$marginalAdhesion, cNames$degreeOfMalignancy, 47)
p6 <- checkLinearity(cleaned, cNames$normalNucleoli, cNames$degreeOfMalignancy, 48)
ggarrange(p5, p6, nrow=2, ncol=1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 0.955
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.045
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 1.4521e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.955
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.045
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 1.4521e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 0.955
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.045
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.955
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.045
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
p7 <- checkLinearity(cleaned, cNames$mitoses, cNames$degreeOfMalignancy, 49) #worst, no greenline
p8 <- checkLinearity(cleaned, cNames$clumpThickness_sq, cNames$degreeOfMalignancy,50) #much better!
ggarrange(p7, p8, nrow=2, ncol=1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : at 0.975
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : radius 0.000625
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 0.975
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `predLoess()`:
## ! NA/NaN/Inf in foreign function call (arg 5)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#clump thickness, uniformity in cell Size, and bland chromatin can be selected since they show fairly linear relationship to our outcome variable.
Looking at Figures 45, 46 and 50, we can say that clump thickness, uniformity in cell Size, and bland chromatin can be selected since they show fairly linear relationship to our outcome variable due to similar variation above and below the linear fit line. In addition to this, we can notice in Fig. 32 (varImpPlot) that bareNuclei was the most important variable. So let us keep that in too, even though its relationship with the outcome variable is not as linear like the other 3 variables. Let us ensure none of these 4 variables are strongly correlated
cor(cleaned$uniformityCellSize, cleaned$clumpThickness_sq) #0.58
## [1] 0.587579
cor(cleaned$blandChromatin, cleaned$uniformityCellSize) #0.74
## [1] 0.7454574
cor(cleaned$blandChromatin, cleaned$clumpThickness_sq) #0.51
## [1] 0.5159256
cor(cleaned$blandChromatin, cleaned$bareNuclei) #0.66
## [1] 0.6665117
cor(cleaned$clumpThickness_sq, cleaned$bareNuclei) #0.55
## [1] 0.5515372
cor(cleaned$uniformityCellSize, cleaned$bareNuclei) #0.68
## [1] 0.6825034
None of the correlations are too strong and hence we can use these 4 input variables to construct our linear model.
checkHomoscedasticity <- function(residuals, fitValues, figNum){
dataH <- data.frame(resid = residuals, fitVal = fitValues)
ggplot(dataH, aes(x= fitVal, y=resid))+
geom_point()+
geom_smooth(method="lm", aes(colour = "linear"))+
geom_smooth(method="loess", aes(colour = "loess"))+
geom_hline(aes(yintercept = 0), color="red")+
scale_colour_manual("Line fit", breaks =c("linear", "loess"), values = c("blue", "lightgreen") )+
labs(subtitle=paste0("Fig. ", figNum,". Linear and polynomial fit of the model"))
}
linear_model <- lm(degreeOfMalignancy~ bareNuclei*clumpThickness_sq*blandChromatin*uniformityCellSize, data=cleaned)
summary(linear_model)
##
## Call:
## lm(formula = degreeOfMalignancy ~ bareNuclei * clumpThickness_sq *
## blandChromatin * uniformityCellSize, data = cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8820 -0.2587 -0.0361 0.1329 4.6729
##
## Coefficients:
## Estimate
## (Intercept) 3.002126
## bareNuclei -0.886691
## clumpThickness_sq -1.678959
## blandChromatin -0.847245
## uniformityCellSize -0.718608
## bareNuclei:clumpThickness_sq 0.463670
## bareNuclei:blandChromatin 0.214850
## clumpThickness_sq:blandChromatin 0.469789
## bareNuclei:uniformityCellSize 0.271347
## clumpThickness_sq:uniformityCellSize 0.484498
## blandChromatin:uniformityCellSize 0.179808
## bareNuclei:clumpThickness_sq:blandChromatin -0.089873
## bareNuclei:clumpThickness_sq:uniformityCellSize -0.113229
## bareNuclei:blandChromatin:uniformityCellSize -0.043072
## clumpThickness_sq:blandChromatin:uniformityCellSize -0.074947
## bareNuclei:clumpThickness_sq:blandChromatin:uniformityCellSize 0.017490
## Std. Error
## (Intercept) 0.386222
## bareNuclei 0.137065
## clumpThickness_sq 0.187202
## blandChromatin 0.130970
## uniformityCellSize 0.204581
## bareNuclei:clumpThickness_sq 0.058013
## bareNuclei:blandChromatin 0.031437
## clumpThickness_sq:blandChromatin 0.065933
## bareNuclei:uniformityCellSize 0.036091
## clumpThickness_sq:uniformityCellSize 0.079145
## blandChromatin:uniformityCellSize 0.034423
## bareNuclei:clumpThickness_sq:blandChromatin 0.013055
## bareNuclei:clumpThickness_sq:uniformityCellSize 0.014120
## bareNuclei:blandChromatin:uniformityCellSize 0.006392
## clumpThickness_sq:blandChromatin:uniformityCellSize 0.013642
## bareNuclei:clumpThickness_sq:blandChromatin:uniformityCellSize 0.002462
## t value Pr(>|t|)
## (Intercept) 7.773 3.23e-14
## bareNuclei -6.469 2.01e-10
## clumpThickness_sq -8.969 < 2e-16
## blandChromatin -6.469 2.01e-10
## uniformityCellSize -3.513 0.000476
## bareNuclei:clumpThickness_sq 7.992 6.56e-15
## bareNuclei:blandChromatin 6.834 1.99e-11
## clumpThickness_sq:blandChromatin 7.125 2.92e-12
## bareNuclei:uniformityCellSize 7.518 1.97e-13
## clumpThickness_sq:uniformityCellSize 6.122 1.65e-09
## blandChromatin:uniformityCellSize 5.223 2.41e-07
## bareNuclei:clumpThickness_sq:blandChromatin -6.884 1.44e-11
## bareNuclei:clumpThickness_sq:uniformityCellSize -8.019 5.39e-15
## bareNuclei:blandChromatin:uniformityCellSize -6.739 3.68e-11
## clumpThickness_sq:blandChromatin:uniformityCellSize -5.494 5.77e-08
## bareNuclei:clumpThickness_sq:blandChromatin:uniformityCellSize 7.105 3.34e-12
##
## (Intercept) ***
## bareNuclei ***
## clumpThickness_sq ***
## blandChromatin ***
## uniformityCellSize ***
## bareNuclei:clumpThickness_sq ***
## bareNuclei:blandChromatin ***
## clumpThickness_sq:blandChromatin ***
## bareNuclei:uniformityCellSize ***
## clumpThickness_sq:uniformityCellSize ***
## blandChromatin:uniformityCellSize ***
## bareNuclei:clumpThickness_sq:blandChromatin ***
## bareNuclei:clumpThickness_sq:uniformityCellSize ***
## bareNuclei:blandChromatin:uniformityCellSize ***
## clumpThickness_sq:blandChromatin:uniformityCellSize ***
## bareNuclei:clumpThickness_sq:blandChromatin:uniformityCellSize ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6361 on 615 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9229
## F-statistic: 504.1 on 15 and 615 DF, p-value: < 2.2e-16
Note: Just 3 input variables (without bare nuclei) was tried too. But including bare nuclei as an input variable increased Adjusted-R squared from 88 to 92%, reduced RSE from 0.78 to 0.63 and all coefficients became significant, so let us keep all 4 variables in.
Summary of the model shows that adjusted R-square is 0.922, which means that 92.2% of the variance in the dataset is explained by our linear model, which is good. RSE is only 0.63 and all coefficients are significant. p-value <0.05 tells us that overall the model is significant. F ~500 is large which is good and shows that our model is a good fit and is significant. Let us check the homoscedasticity and vif score of our model.
checkHomoscedasticity(linear_model$residuals, linear_model$fitted.values, 51)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Our model overall shows a good linear fit due to nearly equal up/down variations around the linear line.
#if high, fix by standardizing the data.
vif(linear_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## bareNuclei
## 358.73614
## clumpThickness_sq
## 24.50473
## blandChromatin
## 139.60412
## uniformityCellSize
## 505.01785
## bareNuclei:clumpThickness_sq
## 494.14785
## bareNuclei:blandChromatin
## 907.35518
## clumpThickness_sq:blandChromatin
## 309.58518
## bareNuclei:uniformityCellSize
## 1375.79745
## clumpThickness_sq:uniformityCellSize
## 620.75021
## blandChromatin:uniformityCellSize
## 863.12118
## bareNuclei:clumpThickness_sq:blandChromatin
## 1135.67915
## bareNuclei:clumpThickness_sq:uniformityCellSize
## 1503.76419
## bareNuclei:blandChromatin:uniformityCellSize
## 2116.31013
## clumpThickness_sq:blandChromatin:uniformityCellSize
## 1010.02569
## bareNuclei:clumpThickness_sq:blandChromatin:uniformityCellSize
## 2242.08558
All vif scores are greater than 5, and very high, with highest being 341, so let us try to standardize the input variables.
db_std <- cleaned %>%
select(clumpThickness_sq, blandChromatin, uniformityCellSize, bareNuclei, degreeOfMalignancy) %>%
mutate(stClumpThickness = clumpThickness_sq - mean(clumpThickness_sq), stBlandChromatin = log2(blandChromatin) - mean(log2(blandChromatin)), stUniformityCellSize = log2(uniformityCellSize)- mean(log2(uniformityCellSize)) , stBareNuclei = log2(bareNuclei) - mean(log2(bareNuclei)) )
lm_centered <- lm(degreeOfMalignancy~ stBareNuclei*stClumpThickness*stUniformityCellSize*stBlandChromatin, data=db_std)
summary(lm_centered)
##
## Call:
## lm(formula = degreeOfMalignancy ~ stBareNuclei * stClumpThickness *
## stUniformityCellSize * stBlandChromatin, data = db_std)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9878 -0.1942 -0.0588 0.1198 4.5633
##
## Coefficients:
## Estimate
## (Intercept) 1.168685
## stBareNuclei 0.412688
## stClumpThickness 0.466088
## stUniformityCellSize 0.789112
## stBlandChromatin 0.335310
## stBareNuclei:stClumpThickness 0.040434
## stBareNuclei:stUniformityCellSize 0.202337
## stClumpThickness:stUniformityCellSize 0.332995
## stBareNuclei:stBlandChromatin 0.018084
## stClumpThickness:stBlandChromatin 0.091229
## stUniformityCellSize:stBlandChromatin 0.342782
## stBareNuclei:stClumpThickness:stUniformityCellSize -0.355272
## stBareNuclei:stClumpThickness:stBlandChromatin -0.032274
## stBareNuclei:stUniformityCellSize:stBlandChromatin -0.055790
## stClumpThickness:stUniformityCellSize:stBlandChromatin -0.001354
## stBareNuclei:stClumpThickness:stUniformityCellSize:stBlandChromatin 0.108153
## Std. Error
## (Intercept) 0.051219
## stBareNuclei 0.044279
## stClumpThickness 0.079777
## stUniformityCellSize 0.051894
## stBlandChromatin 0.052709
## stBareNuclei:stClumpThickness 0.061764
## stBareNuclei:stUniformityCellSize 0.041011
## stClumpThickness:stUniformityCellSize 0.073001
## stBareNuclei:stBlandChromatin 0.042946
## stClumpThickness:stBlandChromatin 0.075712
## stUniformityCellSize:stBlandChromatin 0.045463
## stBareNuclei:stClumpThickness:stUniformityCellSize 0.055419
## stBareNuclei:stClumpThickness:stBlandChromatin 0.056231
## stBareNuclei:stUniformityCellSize:stBlandChromatin 0.033648
## stClumpThickness:stUniformityCellSize:stBlandChromatin 0.060022
## stBareNuclei:stClumpThickness:stUniformityCellSize:stBlandChromatin 0.043733
## t value
## (Intercept) 22.817
## stBareNuclei 9.320
## stClumpThickness 5.842
## stUniformityCellSize 15.206
## stBlandChromatin 6.361
## stBareNuclei:stClumpThickness 0.655
## stBareNuclei:stUniformityCellSize 4.934
## stClumpThickness:stUniformityCellSize 4.561
## stBareNuclei:stBlandChromatin 0.421
## stClumpThickness:stBlandChromatin 1.205
## stUniformityCellSize:stBlandChromatin 7.540
## stBareNuclei:stClumpThickness:stUniformityCellSize -6.411
## stBareNuclei:stClumpThickness:stBlandChromatin -0.574
## stBareNuclei:stUniformityCellSize:stBlandChromatin -1.658
## stClumpThickness:stUniformityCellSize:stBlandChromatin -0.023
## stBareNuclei:stClumpThickness:stUniformityCellSize:stBlandChromatin 2.473
## Pr(>|t|)
## (Intercept) < 2e-16
## stBareNuclei < 2e-16
## stClumpThickness 8.35e-09
## stUniformityCellSize < 2e-16
## stBlandChromatin 3.90e-10
## stBareNuclei:stClumpThickness 0.5129
## stBareNuclei:stUniformityCellSize 1.04e-06
## stClumpThickness:stUniformityCellSize 6.13e-06
## stBareNuclei:stBlandChromatin 0.6738
## stClumpThickness:stBlandChromatin 0.2287
## stUniformityCellSize:stBlandChromatin 1.70e-13
## stBareNuclei:stClumpThickness:stUniformityCellSize 2.89e-10
## stBareNuclei:stClumpThickness:stBlandChromatin 0.5662
## stBareNuclei:stUniformityCellSize:stBlandChromatin 0.0978
## stClumpThickness:stUniformityCellSize:stBlandChromatin 0.9820
## stBareNuclei:stClumpThickness:stUniformityCellSize:stBlandChromatin 0.0137
##
## (Intercept) ***
## stBareNuclei ***
## stClumpThickness ***
## stUniformityCellSize ***
## stBlandChromatin ***
## stBareNuclei:stClumpThickness
## stBareNuclei:stUniformityCellSize ***
## stClumpThickness:stUniformityCellSize ***
## stBareNuclei:stBlandChromatin
## stClumpThickness:stBlandChromatin
## stUniformityCellSize:stBlandChromatin ***
## stBareNuclei:stClumpThickness:stUniformityCellSize ***
## stBareNuclei:stClumpThickness:stBlandChromatin
## stBareNuclei:stUniformityCellSize:stBlandChromatin .
## stClumpThickness:stUniformityCellSize:stBlandChromatin
## stBareNuclei:stClumpThickness:stUniformityCellSize:stBlandChromatin *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6562 on 615 degrees of freedom
## Multiple R-squared: 0.92, Adjusted R-squared: 0.918
## F-statistic: 471.3 on 15 and 615 DF, p-value: < 2.2e-16
vif(lm_centered) #some are between 5-10
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## stBareNuclei
## 5.335170
## stClumpThickness
## 4.182346
## stUniformityCellSize
## 5.672741
## stBlandChromatin
## 3.931246
## stBareNuclei:stClumpThickness
## 4.379642
## stBareNuclei:stUniformityCellSize
## 5.931829
## stClumpThickness:stUniformityCellSize
## 4.457618
## stBareNuclei:stBlandChromatin
## 4.344102
## stClumpThickness:stBlandChromatin
## 3.422137
## stUniformityCellSize:stBlandChromatin
## 4.008463
## stBareNuclei:stClumpThickness:stUniformityCellSize
## 10.612624
## stBareNuclei:stClumpThickness:stBlandChromatin
## 6.466664
## stBareNuclei:stUniformityCellSize:stBlandChromatin
## 10.050919
## stClumpThickness:stUniformityCellSize:stBlandChromatin
## 6.333793
## stBareNuclei:stClumpThickness:stUniformityCellSize:stBlandChromatin
## 10.759931
checkHomoscedasticity(lm_centered$residuals, lm_centered$fitted.values, 52)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
We can see that some vif scores are between 5-10 which indicates
moderate correlation and not all coefficients are significant. On
removing clump thickness from the model, all vif scores drop below 5,
adjusted-r squared reduces from 91.8% to 90.4% which is a small decrease
and all coefficients still remain significant. So we will remove clump
thickness from the final model.
lm_centered <- lm(degreeOfMalignancy~ stBareNuclei*stUniformityCellSize*stBlandChromatin, data=db_std)
summary(lm_centered)
##
## Call:
## lm(formula = degreeOfMalignancy ~ stBareNuclei * stUniformityCellSize *
## stBlandChromatin, data = db_std)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1689 -0.2656 -0.0682 0.2244 4.6518
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.26553 0.04725 26.784
## stBareNuclei 0.34996 0.03806 9.195
## stUniformityCellSize 0.89483 0.04543 19.696
## stBlandChromatin 0.46253 0.04605 10.045
## stBareNuclei:stUniformityCellSize 0.16193 0.03341 4.847
## stBareNuclei:stBlandChromatin 0.08390 0.03674 2.284
## stUniformityCellSize:stBlandChromatin 0.45853 0.03767 12.172
## stBareNuclei:stUniformityCellSize:stBlandChromatin -0.12853 0.02654 -4.843
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## stBareNuclei < 2e-16 ***
## stUniformityCellSize < 2e-16 ***
## stBlandChromatin < 2e-16 ***
## stBareNuclei:stUniformityCellSize 1.59e-06 ***
## stBareNuclei:stBlandChromatin 0.0227 *
## stUniformityCellSize:stBlandChromatin < 2e-16 ***
## stBareNuclei:stUniformityCellSize:stBlandChromatin 1.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7069 on 623 degrees of freedom
## Multiple R-squared: 0.9059, Adjusted R-squared: 0.9048
## F-statistic: 856.9 on 7 and 623 DF, p-value: < 2.2e-16
vif(lm_centered) #all of them are under 5 now.
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## stBareNuclei
## 3.396741
## stUniformityCellSize
## 3.746223
## stBlandChromatin
## 2.585277
## stBareNuclei:stUniformityCellSize
## 3.392606
## stBareNuclei:stBlandChromatin
## 2.739625
## stUniformityCellSize:stBlandChromatin
## 2.371341
## stBareNuclei:stUniformityCellSize:stBlandChromatin
## 5.386947
checkHomoscedasticity(lm_centered$residuals, lm_centered$fitted.values, 53)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Now all the vif scores are under 5 and variance explained is ~91%. Also,
note that our linear model has improved with a tighter fit around the
linear line after we standardized the data.
#Let us split the data into testing and training sets
dim(db_std) #631x7
## [1] 631 9
set.seed(137)
#divide into 1/3 as testing data (210), 2/3rd as training data
test_rows <- sample(1: dim(db_std)[1], 210)
db_test_lm <- db_std[test_rows, ]
db_train_lm <- db_std[-test_rows, ]
#Add Test/Train symbol to original dataset
db_std$Type <- "train"
db_std$Type[test_rows] <- "test"
lm_train <- lm(degreeOfMalignancy~ stBareNuclei*stUniformityCellSize*stBlandChromatin, data=db_train_lm)
summary(lm_train)
##
## Call:
## lm(formula = degreeOfMalignancy ~ stBareNuclei * stUniformityCellSize *
## stBlandChromatin, data = db_train_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0148 -0.2840 -0.0111 0.2334 4.6060
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.30076 0.05505 23.630
## stBareNuclei 0.36087 0.04546 7.938
## stUniformityCellSize 0.99201 0.05172 19.180
## stBlandChromatin 0.37290 0.05324 7.004
## stBareNuclei:stUniformityCellSize 0.19350 0.03980 4.862
## stBareNuclei:stBlandChromatin 0.05914 0.04388 1.348
## stUniformityCellSize:stBlandChromatin 0.41980 0.04230 9.925
## stBareNuclei:stUniformityCellSize:stBlandChromatin -0.16242 0.03211 -5.058
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## stBareNuclei 1.96e-14 ***
## stUniformityCellSize < 2e-16 ***
## stBlandChromatin 1.01e-11 ***
## stBareNuclei:stUniformityCellSize 1.65e-06 ***
## stBareNuclei:stBlandChromatin 0.178
## stUniformityCellSize:stBlandChromatin < 2e-16 ***
## stBareNuclei:stUniformityCellSize:stBlandChromatin 6.38e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6769 on 413 degrees of freedom
## Multiple R-squared: 0.9107, Adjusted R-squared: 0.9092
## F-statistic: 601.5 on 7 and 413 DF, p-value: < 2.2e-16
vif(lm_train) #all of them are under 5
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## stBareNuclei
## 3.450087
## stUniformityCellSize
## 3.502714
## stBlandChromatin
## 2.513785
## stBareNuclei:stUniformityCellSize
## 3.349779
## stBareNuclei:stBlandChromatin
## 2.790691
## stUniformityCellSize:stBlandChromatin
## 2.215007
## stBareNuclei:stUniformityCellSize:stBlandChromatin
## 5.387711
checkHomoscedasticity(lm_train$residuals, lm_train$fitted.values, 54) #shows homoscedasticity still
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Our linear model explains ~90% of the variance in our dataset (Adjusted R-squared 0.902), and shows an RSE of only 0.67. This indicates that our predictions deviate by only by 0.67 units on an average from the true regression line. Most of the coefficients of the model are significant. With p being <0.05, overall our model is significant and shows good performance.
Let us now do predictions on the test data using our linear model
pred_dom <- predict.lm(lm_train, newdata = db_test_lm)
#Plotting the expected/actual and predicted degree of malignancy values
ggplot(db_test_lm, aes(x = db_test_lm$degreeOfMalignancy , y = pred_dom)) +
geom_point(show.legend = T) +
geom_smooth(method = "lm", aes(color = "linear")) +
geom_smooth(method = "loess", aes(color = "loess")) +
xlab("Actual") +
ylab("Predicted")+
geom_point(aes(y = pred_dom), color = "red", shape = 16, alpha = 0.7)+
scale_colour_manual("Line fit", breaks =c("linear", "loess"), values = c("blue", "lightgreen") )+
labs(subtitle="Fig. 55. Linear regression model predictions for degree of malignancy")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
As we can see from the graph, our predictions are fairly similar to the actual values of degree of malignancy. It is more accurate for lower values, and diverges a bit for higher degree of malignancy values. That implies our model is really good at predicting benign and moderately malignant tumors, but might lead to some deviation when it comes to extremely malignant tumors with very high degree of malignancy. We must emphasize that it could be due to input variables distribution not being entirely normal, so we might need more normalized data inputs. Therefore, we can say that our linear model performs well for the most part and predicts the degree of malignancy in the correct trend as expected.
Hence, using the machine learning analysis, we were able to achieve a 96% accurate random forest and a well performing linear model explaining 89% variance in our dataset which enabled us to efficiently predict both the class and degree of malignancy of a tumor respectively. We also need to keep in mind that most of our variables were not normally distributed so our models may not be entirely accurate and might need more normally distributed data points to get a more accurate picture of the predictions.